Mining Databases across Multiple Tables
نویسنده
چکیده
Project Summary Current mining technology typically applies to centrally stored data (i.e., in one single repository, with central administration etc.). However, real-life datasets are often decentralized (i.e., consisting of several tables, perhaps obtained via normalization or partitioning / allocation, stored in several repositories). The goal of this research project is to develop mining techniques for decentralized data. The key idea is that in contrast to traditional techniques (where the data is joined first to form a single table), the decentralized approach concurrently generates partial results on the separate tables, and thereafter, the foreign key relationships are utilized to merge these results. A similar approach is examined for classification in decentralized datasets, and the techniques chosen are those most amenable to decentralization, or as indicated by the applications. Efficiency analyses is used to assess the techniques, and empirically validated on available synthetic and real datasets. The effects of different database design choices on the decentralized mining algorithms is also considered. Systematic techniques are developed to use, together with the catalog statistics, the details of database design (e.g., normalization, and partitioning / allocation information) to optimize for efficient execution. This research also benefits educational activities; it provides educational experience for graduate students involved in research. Furthermore, the techniques can be applied to mining distributed relational metadata for public information repositories; e.g., it can allow programming advanced applications, such as information mining, for the datasets referenced-initially for those by students, and thereafter, users of any Web-based datasets.
منابع مشابه
Exploring multi-dimensional sequential patterns across multi-dimensional multi-sequence databases
Existing multi-dimensional sequential pattern mining methods only discover multi-dimensional sequential pattern in databases involving one sequential dimension. Since multi-dimensional sequential patterns may exist in databases containing more than one sequential dimension, in this paper, we present algorithm PSeq-MIDim for mining multi-dimensional sequential patterns from multiple sequential d...
متن کاملJoin Bayes Nets: A New Type of Bayes net for Relational Data
Many real-world data are maintained in relational format, with different tables storing information about entities and their links or relationships. The structure (schema) of the database is essentially that of a logical language, with variables ranging over individual entities and predicates for relationships and attributes. Our work combines the graphical structure of Bayes nets with the logi...
متن کاملAn Efficient Multi-relational Naïve Bayesian Classifier Based on Semantic Relationship Graph
Classification is one of the most popular data mining tasks with a wide range of applications, and lots of algorithms have been proposed to build accurate and scalable classifiers. Most of these algorithms only take a single table as input, whereas in the real world most data are stored in multiple tables and managed by relational database systems. As transferring data from multiple tables into...
متن کاملPrivacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation
The growing expanse of e-commerce and the widespread availability of online databases raise many fears regarding loss of privacy and many statistical challenges. Even with encryption and other nominal forms of protection for individual databases, we still need to protect against the violation of privacy through linkages across multiple databases. These issues parallel those that have arisen and...
متن کاملVisually Mining on Multiple Relational Tables at Once
Data mining (DM) processes require data to be supplied in only one table or data file. Therefore, data stored in multiple relations of relational databases must be joined before submission to DM analysis. A problem faced during this preparation step is that, most of the times, the analyst does not have a clear idea of what portions of data should be mined. This paper reckons the strong human ab...
متن کامل